Skip to content

Handle single manager reload by having workers reconnect#1775

Merged
mavenugo merged 1 commit into
moby:masterfrom
sanimej:gossip
May 31, 2017
Merged

Handle single manager reload by having workers reconnect#1775
mavenugo merged 1 commit into
moby:masterfrom
sanimej:gossip

Conversation

@sanimej
Copy link
Copy Markdown

@sanimej sanimej commented May 23, 2017

Related to moby #33346

In a cluster with one manager if that manager node is rebooted or daemon is restarted the manager node gets isolated from the gossip cluster. This is because in the single manager case its the one bootstrapping the gossip when other worker nodes connect to it. When that node restarts the libnetwork agent has no other node info to bootstrap the gossip.

Change in this PR is to have the workers try reconnecting to the bootstrapping manager periodically. This uses the already available reconnect logic which was originally added for the worker node, ie: if a worker node is isolated in the gossip cluster because of lets say network congestion when it thaws it will connect to the previously connected nodes to rebuild its gossip state. Normally this won't kick in when a node intentionally leaves the cluster which will be case in this scenario. This change is to treat it similar to gossip disconnect scenario so that the worker nodes can retry.

Signed-off-by: Santhosh Manohar santhosh@docker.com

Comment thread networkdb/delegate.go
// kick in. Because in a single manager cluster manager's gossip can't be
// bootstrapped unless some other node connects to it.
if nEvent.Type == NodeEventTypeLeave {
for _, ip := range nDB.bootStrapIP {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would trigger this logic only in case len(nDB.bootStrapIP) == 1 to cover the specific case, else we ends up to retry to connect also in the case of 2 or more managers

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering making it specific to the one manager case. But in the multiple manager case also workers connecting to a restarting manager should be ok, isn't it ? It will help in that manager node converging faster.

Comment thread networkdb/delegate.go
// time.
nDB.networkClock.Witness(nEvent.LTime)

n := nDB.getNode(nEvent)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can just search if the node is in the nodes list and ignore it if it is in the leftNodes and failedNodes

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently checkAndGetNode gets called in handleNodeEvent. This function checks all the lists. Since we are adding getNode before checkAndGetNode its safer to keep the same search logic. Making it too restrictive (ie: searching only in nDB.nodes) and bailing out early can break some other scenario thats currently working.

Also, in this particular issue its ok even if the node is in the failedNodes. It can happen if there is out of order events and we have already handled NotifyLeave

@mavenugo mavenugo mentioned this pull request May 25, 2017
23 tasks
@fcrisciani
Copy link
Copy Markdown

Discussed yesterday with @sanimej and we concluded that at the moment this logic is the safest approach.
At the moment the delegate does not distinguish if a node is leaving explicitly as a result of a node leave or it just crashed and became unresponsive. For this reason the behavior now for the managers would be to retry them no matter what for 24h. Without this logic it is possible that a rapid reboot of the 3 managers, will let them create a new gossip cluster by themselves completely isolated by the rest of the workers.

@thaJeztah just FYI as a follow up of our previous discussion. This means that after this change you will see that logs for 24h (every 1s for the manager IPs)

LGTM, would be ideal to handle the explicit leave with no retry but this change a part from some log spam won't affect functionality

@sanimej
Copy link
Copy Markdown
Author

sanimej commented May 25, 2017

Unrelated to this PR. But I think retry every second is a bit too aggressive. That can be changed and also rate-limit the log message to one in every few mins or so. Will make that change in a separate PR.

Signed-off-by: Santhosh Manohar <santhosh@docker.com>
@mavenugo
Copy link
Copy Markdown
Contributor

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants